Skip to main content

Setup Your Spark Project

Required Spark Version

LakeSoul is currently available with Scala version 2.12 and Spark version 3.3.

Setup (Py)Spark Shell or Spark SQL Shell

To use spark-shell, pyspark or spark-sql shells, you should include LakeSoul's dependencies. There are two approaches to achieve this.

Use Maven Coordinates via --packages

spark-shell --packages com.dmetasoul:lakesoul-spark:2.2.0-spark-3.3

Use Local Packages

You can find the LakeSoul packages from our release page: Releases. Download the jar file and pass it to spark-submit.

spark-submit --jars "lakesoul-spark-2.2.0-spark-3.3.jar"

Or you could directly put the jar into $SPARK_HOME/jars

Setup Java/Scala Project

Include maven dependencies in your project:

<dependency>
<groupId>com.dmetasoul</groupId>
<artifactId>lakesoul</artifactId>
<version>2.2.0-spark-3.3</version>
</dependency>

Pass lakesoul_home Environment Variable to Your Job

If you are using Spark's local or client mode, you could just export the env var in your shell:

export lakesoul_home=/path/to/lakesoul.properties

If you are using Spark's cluster mode, in which the driver would also be scheduled into Yarn or K8s cluster, you can setup the driver's env:

  • For Hadoop Yarn, pass --conf spark.yarn.appMasterEnv.lakesoul_home=lakesoul.properties --files /path/to/lakesoul.properties to spark-submit command;
  • For K8s, pass --conf spark.kubernetes.driverEnv.lakesoul_home=lakesoul.properties --files /path/to/lakesoul.properties to spark-submit command.